Goto

Collaborating Authors

 kruskal rank


Identification of Multivariate Measurement Error Models

arXiv.org Machine Learning

Multivariate continuous latent variables arise in numerous empirical settings in economics, psychology, marketing, epidemiology, and the social sciences. Examples include multidimensional skills, cognitive factors, latent preferences, health indices, productivity components, and risk attitudes. In practice, such latent constructs are rarely observed directly; instead, researchers rely on multiple imperfect measurements that contain potentially correlated forms of noise. The resulting measurement-error problem is especially severe when the latent variable is multidimensional: each measurement typically captures only a low-dimensional projection of the latent vector, and the noise may exhibit dependence or heterogeneity across measurement channels. As a result, classical approaches to continuous measurement error, which often rely on injectivity, offer limited guidance. This paper develops a identification strategy tailored specifically to multivariate continuous latent variables measured with noise. The key innovation is to combine tools from multi-linear algebra--specifically the uniqueness properties of so-called CP tensor decompositions-- with the multivariate extension of Kotlarski's identity, a powerful deconvolution result based on characteristic functions. The starting point is the observation that third-order cross-moments of three separate measurements form a three-way moment tensor whose CP decomposition is governed by the latent factor loading matrices. By invoking Kruskal's theorem (Kruskal, 1977a), I show that these loadings are generically identifiable even when each measurement matrix is rank-deficient and--even more surprisingly--when the stack of all measurement matrices is non-injective.


Efficient Algorithms for Verifying Kruskal Rank in Sparse Linear Regression and Related Applications

arXiv.org Artificial Intelligence

We present novel algorithmic techniques to efficiently verify the Kruskal rank of matrices that arise in sparse linear regression, tensor decomposition, and latent variable models. Our unified framework combines randomized hashing techniques with dynamic programming strategies, and is applicable in various settings, including binary fields, general finite fields, and integer matrices. In particular, our algorithms achieve a runtime of $\mathcal{O}\left(dk \cdot \left(nM\right)^{\lceil k / 2 \rceil}\right)$ while ensuring high-probability correctness. Our contributions include: A unified framework for verifying Kruskal rank across different algebraic settings; Rigorous runtime and high-probability guarantees that nearly match known lower bounds; Practical implications for identifiability in tensor decompositions and deep learning, particularly for the estimation of noise transition matrices.


Memorization Capacity of Multi-Head Attention in Transformers

arXiv.org Artificial Intelligence

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$, and context size $n < d$, featuring $\Theta(Hd^2)$ parameters, can memorize $\Omega(Hn)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.


Learning from Multiple Noisy Partial Labelers

arXiv.org Machine Learning

Programmatic weak supervision creates models without hand-labeled training data by combining the outputs of noisy, user-written rules and other heuristic labelers. Existing frameworks make the restrictive assumption that labelers output a single class label. Enabling users to create partial labelers that output subsets of possible class labels would greatly expand the expressivity of programmatic weak supervision. We introduce this capability by defining a probabilistic generative model that can estimate the underlying accuracies of multiple noisy partial labelers without ground truth labels. We prove that this class of models is generically identifiable up to label swapping under mild conditions. We also show how to scale up learning to 100k examples in one minute, a 300X speed up compared to a naive implementation. We evaluate our framework on three text classification and six object classification tasks. On text tasks, adding partial labels increases average accuracy by 9.6 percentage points. On image tasks, we show that partial labels allow us to approach some zero-shot object classification problems with programmatic weak supervision by using class attributes as partial labelers. Our framework is able to achieve accuracy comparable to recent embedding-based zero-shot learning methods using only pre-trained attribute detectors


When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

arXiv.org Machine Learning

Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish generic identifiability under a constraint, referred to as topic persistence. Our sufficient conditions for identifiability involve a novel set of "higher order" expansion conditions on the topic-word matrix or the population structure of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allows for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP) decomposition.